feat: pipeline accepts vep.tar.gz or vep/ dir#105
Conversation
|
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
📝 WalkthroughWalkthroughThe PR adds support for accepting Ensembl VEP cache as either a pre-extracted directory or a 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
main.nf (1)
134-134: ⚡ Quick win
ensemblvep_cachetake parameter is declared but never used.The
FULCRUMGENOMICS_TWISTCGPworkflow acceptsensemblvep_cacheas atake:input (line 134), but the new if/else block readsparams.ensemblvep_cachedirectly and never references the channel parameter. This makes the take parameter dead code and couples the inner workflow to global params, reducing reusability.Consider either: (a) using the
ensemblvep_cachechannel parameter in the if/else logic and removing the directparamsaccess, or (b) removing theensemblvep_cachetake parameter entirely and updating the caller at line 97.Also applies to: 170-181
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@main.nf` at line 134, The workflow FULCRUMGENOMICS_TWISTCGP declares an input channel ensemblvep_cache but the if/else logic reads params.ensemblvep_cache directly, making the channel unused; update the conditional and downstream uses to consume the ensemblvep_cache channel instead of params.ensemblvep_cache (e.g. use ensemblvep_cache.first() or check ensemblvep_cache.empty? as appropriate) so the workflow is driven by its input channel, and apply the same change to the duplicate logic around the block that mirrors lines 170-181; alternatively, if you prefer the global param, remove the ensemblvep_cache take: declaration and update callers accordingly—pick one approach and make the code consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@main.nf`:
- Around line 171-175: The UNTAR_VEP_CACHE process is receiving a single-element
list because `.collect()` wraps the 2-element tuple emitted by
Channel.fromPath(...).map { [[id: 'vep_cache'], it] } into a list; remove the
`.collect()` so the channel supplies a destructurable 2-element tuple to
UNTAR_VEP_CACHE. In short: change the code that builds the input channel for
UNTAR_VEP_CACHE (the Channel.fromPath(params.ensemblvep_cache).map { [[id:
'vep_cache'], it] } expression) to omit `.collect()` so UNTAR_VEP_CACHE receives
tuple val(meta), path(archive) as expected.
In `@modules/nf-core/untar/main.nf`:
- Around line 53-62: The single-top-level-dir branch currently writes files
using the original archive paths so ${prefix} stays empty; change the loop
handling when the test on ${archive} is true to strip the first path component
from each archive entry and create files/dirs under ${prefix} (i.e., derive a
variable like stripped=$(echo "${i}" | sed -E 's#^[^/]+/##') and then use
${prefix}/$stripped for mkdir -p and touch), ensuring directories and files are
created inside ${prefix} rather than at the archive root; update the branch that
iterates over tar -tf ${archive} to use this stripped path logic for both files
and directories.
---
Nitpick comments:
In `@main.nf`:
- Line 134: The workflow FULCRUMGENOMICS_TWISTCGP declares an input channel
ensemblvep_cache but the if/else logic reads params.ensemblvep_cache directly,
making the channel unused; update the conditional and downstream uses to consume
the ensemblvep_cache channel instead of params.ensemblvep_cache (e.g. use
ensemblvep_cache.first() or check ensemblvep_cache.empty? as appropriate) so the
workflow is driven by its input channel, and apply the same change to the
duplicate logic around the block that mirrors lines 170-181; alternatively, if
you prefer the global param, remove the ensemblvep_cache take: declaration and
update callers accordingly—pick one approach and make the code consistent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 991e27e3-1833-42cf-a8da-2b8298feee49
⛔ Files ignored due to path filters (1)
modules/nf-core/untar/tests/main.nf.test.snapis excluded by!**/*.snap
📒 Files selected for processing (8)
docs/variant_annotation.mdmain.nfmodules.jsonmodules/nf-core/untar/environment.ymlmodules/nf-core/untar/main.nfmodules/nf-core/untar/meta.ymlmodules/nf-core/untar/tests/main.nf.testnextflow_schema.json
6b09a9b to
e0939bf
Compare
| // MODULE: PERBASE | ||
| // | ||
| PERBASE(ALIGNBAM.out.bam_bai, ch_fasta.join(ch_fasta_fai).first()) | ||
| PERBASE(ALIGNBAM.out.bam_bai, ch_fasta.join(ch_fasta_fai)) |
| channel.fromPath(params.ensemblvep_cache) | ||
| .map { it -> [[id: 'vep_cache'], it] } | ||
| .collect() | ||
| ) |
There was a problem hiding this comment.
Coderabbit is correct on the bug, but its solution leaves this as a queue channel and so is wrong in a different way.
| channel.fromPath(params.ensemblvep_cache) | |
| .map { it -> [[id: 'vep_cache'], it] } | |
| .collect() | |
| ) | |
| channel.fromPath(params.ensemblvep_cache) | |
| .collect { it -> [[id: 'vep_cache'], it] } | |
| ) |
or
| channel.fromPath(params.ensemblvep_cache) | |
| .map { it -> [[id: 'vep_cache'], it] } | |
| .collect() | |
| ) | |
| channel.value(file(params.ensemblvep_cache)) | |
| .map { it -> [[id: 'vep_cache'], it] } | |
| ) |
| .map { it -> [[id: 'vep_cache'], it] } | ||
| .collect() | ||
| ) | ||
| ch_vep_cache = UNTAR_VEP_CACHE.out.cache.collect() |
There was a problem hiding this comment.
| ch_vep_cache = UNTAR_VEP_CACHE.out.cache.collect() | |
| ch_vep_cache = UNTAR_VEP_CACHE.out.cache.first() |
Same idea as above.
However, I think nextflow will actually implicitly preserve value channel status if all the inputs to a process are value channels. I don't see that documented anywhere though so we shouldn't rely on it.
process VALUE_IN_VALUE_OUT {
input:
tuple val(meta), path(some_file)
output:
tuple val(meta), path("output.txt"), emit: output
script:
"""
touch output.txt
"""
}
workflow {
ch = channel.fromPath("somefile.txt")
.collect { it -> [[id: "file"], it] }
print(ch)
ch.view()
VALUE_IN_VALUE_OUT(ch)
ch2 = VALUE_IN_VALUE_OUT.out.output
print(ch2)
ch2.view()
}DataflowVariable(value=null)
DataflowVariable(value=null)
executor > local (1)
[fb/41b205] VALUE_IN_VALUE_OUT | 1 of 1 ✔
[['id':'file'], /.../somefile.txt]
[[id:file], /.../work/fb/41b2054051393a8611a540b98dd9d5/output.txt]
| ch_vep_cache = UNTAR_VEP_CACHE.out.cache.collect() | ||
| } else { | ||
| ch_vep_cache = params.ensemblvep_cache | ||
| ? channel.fromPath(params.ensemblvep_cache).map { it -> [[id: 'vep_cache'], it] }.collect() |
|
I ran this branch with both an uncompressed directory as well as a vep.tar.gz, here's how the cache path resolves: tarbelldirectory |
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
main.nf (1)
173-174: ⚡ Quick winUse consistent pattern for directory paths.
The tarball branch (line 168-169) uses
channel.value(file(...)).map {...}while the directory branch useschannel.fromPath(...).collect {...}. Both produce value channels but the pattern inconsistency is confusing. Also,fromPathis designed for glob patterns; for directory paths,file()wrapped inchannel.value()is more explicit.♻️ Align with tarball branch pattern
ch_vep_cache = params.ensemblvep_cache - ? channel.fromPath(params.ensemblvep_cache).collect { it -> [[id: 'vep_cache'], it] } + ? channel.value(file(params.ensemblvep_cache)).map { it -> [[id: 'vep_cache'], it] } : PREPARE_ANNOTATION_DB.out.ensemblvep_cache🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@main.nf` around lines 173 - 174, ch_vep_cache uses channel.fromPath(...) for the directory branch while the tarball branch uses channel.value(file(...)).map(...); change the directory branch to the same explicit pattern by wrapping params.ensemblvep_cache with file(...) and channel.value(...) and using .map to produce the [[id:'vep_cache'], it] tuple so both branches use the same value-channel approach (refer to ch_vep_cache and params.ensemblvep_cache).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modules/local/untar_vep_cache/main.nf`:
- Around line 22-25: After extracting ${archive} into _tmp, validate that there
is exactly one top-level entry before running mv: list entries in _tmp (the
current code's top_level capture), count them, and if the count != 1 emit an
error (including the unexpected entries) and exit non-zero; only when count == 1
proceed to set top_level and mv "_tmp/${top_level}" cache. Reference the
existing variables/commands top_level, _tmp, ${archive}, tar and mv to locate
where to add this validation.
---
Nitpick comments:
In `@main.nf`:
- Around line 173-174: ch_vep_cache uses channel.fromPath(...) for the directory
branch while the tarball branch uses channel.value(file(...)).map(...); change
the directory branch to the same explicit pattern by wrapping
params.ensemblvep_cache with file(...) and channel.value(...) and using .map to
produce the [[id:'vep_cache'], it] tuple so both branches use the same
value-channel approach (refer to ch_vep_cache and params.ensemblvep_cache).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: dcdcf3da-ff65-4283-9be8-ca4b82e0dab5
📒 Files selected for processing (8)
CHANGELOG.mdconf/modules.configdocs/variant_annotation.mdmain.nfmodules/local/untar_vep_cache/environment.ymlmodules/local/untar_vep_cache/main.nfmodules/local/untar_vep_cache/meta.ymlnextflow_schema.json
✅ Files skipped from review due to trivial changes (4)
- CHANGELOG.md
- modules/local/untar_vep_cache/environment.yml
- modules/local/untar_vep_cache/meta.yml
- docs/variant_annotation.md
🚧 Files skipped from review as they are similar to previous changes (1)
- nextflow_schema.json
| mkdir -p _tmp | ||
| tar -xzf ${archive} -C _tmp | ||
| top_level=\$(ls _tmp | head -1) | ||
| mv "_tmp/\${top_level}" cache |
There was a problem hiding this comment.
Validate exactly one top-level entry after extraction.
The script assumes the tarball contains exactly one top-level directory. If the archive has multiple entries, only the first is moved; others are silently lost. If zero entries or unexpected structure, mv fails.
🛡️ Add validation before moving
mkdir -p _tmp
tar -xzf ${archive} -C _tmp
+entry_count=\$(ls -A _tmp | wc -l)
+if [ \$entry_count -ne 1 ]; then
+ echo "Error: Expected exactly one top-level entry in tarball, found \$entry_count" >&2
+ exit 1
+fi
top_level=\$(ls _tmp | head -1)
mv "_tmp/\${top_level}" cache📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| mkdir -p _tmp | |
| tar -xzf ${archive} -C _tmp | |
| top_level=\$(ls _tmp | head -1) | |
| mv "_tmp/\${top_level}" cache | |
| mkdir -p _tmp | |
| tar -xzf ${archive} -C _tmp | |
| entry_count=\$(ls -A _tmp | wc -l) | |
| if [ \$entry_count -ne 1 ]; then | |
| echo "Error: Expected exactly one top-level entry in tarball, found \$entry_count" >&2 | |
| exit 1 | |
| fi | |
| top_level=\$(ls _tmp | head -1) | |
| mv "_tmp/\${top_level}" cache |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modules/local/untar_vep_cache/main.nf` around lines 22 - 25, After extracting
${archive} into _tmp, validate that there is exactly one top-level entry before
running mv: list entries in _tmp (the current code's top_level capture), count
them, and if the count != 1 emit an error (including the unexpected entries) and
exit non-zero; only when count == 1 proceed to set top_level and mv
"_tmp/${top_level}" cache. Reference the existing variables/commands top_level,
_tmp, ${archive}, tar and mv to locate where to add this validation.
f369e2e to
406e13b
Compare
Closes #96.
The reason we can't use the nf-core module
UNTARis because it applies--strip-components 1when extracting, which strips the top-level directory from the archive. VEP requires the species/ subdirectory to be present under the cache root. We don't runVEPas part of the pipeline tests, I picked up on this with Claude.